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Abstract 

Nonsingular estimation of high dimensional covariance matrices is an 
important step in many statistical procedures like classification, cluster- 
ing, variable selection an future extraction. After a review of the essential 
background material, this paper introduces a technique we call slicing for 
obtaining a nonsingular covariance matrix of high dimensional data. Slic- 
ing is essentially assuming that the data has Kronecker delta covariance 
structure. Finally, we discuss the implications of the results in this pa- 
per and provide an example of classification for high dimensional gene 
expression data. 

1 Intoduction 

The advances in data collection methods and the increase in data storage and 
processing capabilities has led to data sets that are not suitable for analysis with 
the classical statistical approaches. For example, through DNA micro array 
techniques, the expression levels of millions of genes can easily be obtained. 
However, usually, the number of observations (the sample size) is much less 
than the number of expression levels observed. This is the characteristic of 
many recent data sets in bioinformatics, signal processing, and many other 
fields of science. The number of variables (p) is much higher than the number 
of observations (N) (i.e., N <<p). 

It is well known that when N < p the usual sample covariance matrix will be 
singular. Many methods in statistics, like clustering and classification depends 
on estimating the inverse of the covariance matrix. For small samples and 
especially when N << p This becomes a major problem when the we need to 
obtain a the inverse of the covariance matrix. 



The technique shcing, which we will discuss in detail in this paper, is essen- 
tially obtaining estimates of the covariance matrix under the assumption that 
assuming that the p-dimensional observations are realizations from a multivari- 
ate distribution with a certain Kronecker delta structure. Slicing is appropriate 
when the number of observations in the sample is much less than the number of 
variables. because by choosing a Kronecker structure for the covariance a great 
deal of decrease in the number of parameters is obtained. By using 2-way, 3- 
way, and in general i-way Kronecker structures for the covariance matrix, we 
can obtain nonsingular estimates of the covariance matrix when N << p. 

While developing slicing, we have used the concept of array variate normal 
variable with multiway Kronecker delta structure obtained by using the rules 
of multi linear algebra. In Section 2, we will first review array algebra as its 
discussed in [9] , [10] , Blaha [3] . The array variate normal model with Kronecker 
delta structure and estimation of its parameters are also discussed in Section 
2. In Section 3, we describe slicing in detail, provide the results from various 
simulations and apply the technique to high dimensional gene expression data. 

2 Array Algebra and Array Variate Normal Ran- 
dom Variable 

2.1 Array Algebra 

In this paper we will only study arrays with real elements. We will write X to 
say that X is an array. When it is necessary we can write the dimensions of 
the array as subindices, e.g., if X is a mi x 7712 x ma x 7714 dimensional array 
in /j"iix™2x...xmi^ then we can write Xm1y.m2y.m3y.m4,- To refer to an element 
of an array Xmjxm2X)n3xm4, we write the position of the element as a subindex 
to the array name in parenthesis, (^)7.^7-2r3r4- 

We will now review some basic principles and techniques of multi linear 
algebra. These results and their proofs can be found in Rauhala [9], [10] and 

E. 

Definition 2.1. Inverse Kronecker product 0/ two matrices A and B of dimen- 
sions p X q and r x s correspondingly is written as A®^ B and is defined as 
A & B = [A(B)jk]pryqs = B I® A^ where represents the ordinary Kronecker 
product. 

The following properties of the inverse Kronecker product are useful: 

• & A = A & ^ 0. 

• {Ai + A2) & B = Ai & B + A2& B. 

• A& {Bi + B2) ^ A & Bi+ A & B2. 

• aA & /3B = a/SA & B. 
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• {Ai & Bi){A2 & B2) = A1A2 & B1B2. 

• {A& B)-^ = {A-^ & B~^). 

• {A & _B)+ = (g)* B^), where is the Moore-Penrose inverse of A. 

• {A & B)^ = {A^ & B^), where A~ is the /-inverse of A defined as 
A- = {A'A)-^A'. 

• ff {Ai} and {fij} are the eigenvalues with the corresponding eigenvectors 
{a;,;} and {Vj} for matrices A and B respectively, then A & B has eigen- 
values {Xijij} with corresponding eigenvectors {xi & y^}. 

• Given two matrices A„xn and B„ixni \A& B\ = tr(A& B) = 
tr{A)tr{B). 

• A& B = B ® A = UiA^ BU2, for some permutation matrices Ui and C/2. 
It is well known that a matrix equation 

AXB' = C 
can be rewritten in its mono linear form as 

A& Bvec{X) = vec{C). (1) 
Furthermore, the matrix equality 

A (g)' BXC = E 

obtained by stacking equations of the form ([T]) can be written in its mono linear 
form as 

{A (g)* B & C)vec{X) = vec{E). 

This process of stacking equations could be continued and R-matrix multiplica- 
tion operation introduced by Rauhala [W' provides a compact way of representing 
these equations in array form: 

Definition 2.2. R-Matrix Multiplication is defined element wise: 
{{A,)\A2)\..{A,Y 

mi rn2 

= ^(^l)qiri ^ (^2)q2r2 (^3)93^3 ■ ■ ■ (^Ogin (^)rir2...ri ■ 

ri=l r2— 1 ^3 — 1 Ti — l 

R-Matrix multiplication generalizes the matrix multiplication (array multi- 
plication in two dimensions)to the case of fc-dimensional arrays. The following 
useful properties of the R-Matrix multiplication are reviewed by Blaha j3j : 

• {AfB = AB. 
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. {A^f{A2)^C = A^CA'^. 

• Y = {lf{lf ...{ifY. 

• ((Ai)i(A2)2 . . . (A0")((Bi)'(B2)' . . . {B^y)Y = {AiBry{A2B2f . . . {AAYY. 

The operator rvec describes the relationship between ^mixm2x...mi and its 
mono hnear form Xm^m2...mi^i- 

Definition 2.3. rvec{Xm-^^^„i2^...mi) — Xmim2...miXi where x is the column 
vector obtained by stacking the elements of the array X in the order of its 
dimensions; i.e., (^)jij2---ii — where j = (ji — l)ni-ini-2 . . .ni + {ji — 

2)nj_2»^<-3 . . . rii + . . . + (j2 - 1)"-! + .ii- 

Let imixm2x...mi — (^i)"'^(^2)^ • • • {AiY X wherc {AjY is an nij x uj matrix 
for j = 1,2, ... ,i and X is an ni x ^2 x . . . x array. Write 1 = rvec{L) and 
X = rvec{X). Then, I = Ai& A2 & . . .& AiX. Therefore, there is an equivalent 
expression of the array equation in mono linear form. 

Definition 2.4. The square norm of Xmixm2x...mi is defined as 

mi m2 rrii 

ji=ii2=i ji=i 

Definition 2.5. The distance o/Xi„j^xm2x...mi f™''^ ^2mixm2x...mi is defined 
as 

'\Z-X2\\^. 

Example 2.1. Let Y ^ {Ai^ {A2y . . . {A,y X + E . Then is minimized 

forX = {A^)\A^)\..{AryY. 

2.2 Array Variate Normal Distribution 

Definition 2.6. (Ill) Let Ai, A2, . . . , Ai are non singular matrices of orders 
TOi,m2, . . . ,mi and let M be an miX TO2 x . . . xm^ dimensional constant ar- 
ray. Then the pdf of array normal random variable X with Kronecker delta 
covariance structure is given by 

exp (-U\(A-^y(A-^f . . . (Ary(X - M)f) 

(2) 

Distributional properties of a array normal variable with density in the form 
of Theorem |2.6| can obtained by using the equivalent mono linear representa- 
tion. The moments, the marginal and conditional distributions, independence 
of variates should be studied considering the equivalent mono linear form of the 
array variable and the well known properties of the multivariate normal random 
variable. 



4 



Definition 2.7. For the mi x 777,2 x . . . x dimensional array variate random 
variable X, the principal components are defined as the principal components of 
the d = 7T7i7T72 . . . mi-dimcnsional random vector rvec{X). 

The main statistical problem is the estimation of the covariance of rvec{X), 
its eigenvectors and eigenvalues for small sample sizes. 



2.3 Estimation 

In this section we provide an heuristic method of estimating the model param- 
eters. The optimality of these estimators are not proven but merely checked by 
simulation studies. Inference about the parameters of the model in Theorem |2.6 
for the matrix variate case has been considered in the statistical literature 
[T^ . [H],[Ilj, etc.). In these papers, the unique maximum likelihood estimators 



of the parameters of the model in Theorem 2.6 for the matrix variate case are 
obtained under different assumptions for the covariance parameters. Some clas- 
sification rules based on the matrix variate observations with Kronecker delta 
covariance structures have been studied in |13j . and also in [6 . 



The model in Theorem 2.6 the way it is stated is unidentifiable. However, 
this problem can easily be resolved by putting restrictions on the covariance 
parameters. The approach we take is to assume that j — 1 of the last diagonal 
elements of matrices AjA'^ are equal to 1 for j — 1,2, . . . ,i. The Flip-Flop Algo- 
rithm is proven to attain the maximum likelihood estimators of the parameters 
of two dimensional array variate normal distribution |14| . 

The following is similar to the flip flop algorithm. First, assume {Xi, X2, ■ ■ ■ , 
Xn} is a random sample from a N{M, Ai, A2, . . . Ai) distribution with j — 1 
of the last diagonal elements of matrices AjA'^ equal to 1 for j = 1,2,..., i. 
Further, we assume that all A'^s are square positive definite matrices of rank at 

least j. Finally, assume that we have N Y\j=i ™i > "7,^ for all r = 1, 2, . . . , i. 
Algorithm for estimation: 

1. Estimate M by M — J2iLi ^^^d obtain the centered array observa- 
tions Xj; = Xi -M ioi I ^1,2,... , N. 

2. Start with initial estimates of ^2, ^3, • ■ • , ^i- 

3. On the basis of the estimates oi A2, A^, . . . , Ai calculate an estimate of Ai 
by first scaling the array observations using 

Z,^ilYiA^'r,iA^'f,...,iA-yx^, 

and then calculating the square root of covariance along the 1st dimension 
of the arrays Zi, I ^ 1,2, N. 

4. On the basis of the most recent estimates of the model parameters, esti- 
mate Aj j ~ 2, . . . , i. by first scaling the array observations using 

= (A^'YiA^'f, . . . {A-l,y-'i{Ajl,y+' . . . {A-'yx^, 
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and then calculating the square root of covariance along the jth dimension 
of the arrays Z;'s for j = 2, . . . ,i. Scale the estimate of AjA'j so that the 
last j — 1 diagonal elements are equal to 1. 

5. Repeat steps 3 and 4 until convergence is attained. 

Let Xi, I = 1,2,..,N be a random sample for the array variate random 
variable X. Let p = mim2 ■ ■ .mi. When iV < p, it is well known that the 
usual covariance estimator for rvec{X) will be singular with probability one. 
Therefore, when N < p, there is no consistent estimator of the covariance of 
rvec{X) under the unstructured covariance assumption. 

On the other hand, if we assume that the covariance matrix has Kronecker 
delta structure, we can obtain a nonsingular estimate of the covariance structure 
with the methods developed in this section. The condition on the sample size is 
relaxed considerably. If we havepiV > for r = 1, 2, . . . , i and the assumptions 
stated before the algorithm for estimation of the parameters of this model hold, 
then the estimator of the covariance matrix is nonsingular. When the covariance 
does not have Kronecker structure, the estimate obtained here could be used as 
regularized nonsingular estimate of the covariance. 

Example 2.2. Let 



.1/2 / 3 -1 X'''' 

Ai={ , t ] , ^2 = 2 



1 2 



-1 1 



and 



4 1 \ ^''^ 



As = 1 . 
V 1 1 / 

Also, let M be the array of dimensions 2x3x3. The following are the 
estimates of Ai, A2 and A3 based on a random sample of size 100 from the 
N{AuA2,A3,M). 

. X 1/2 / 4.69 0.25 -0.52 \ 

Ai = ( i;i3 I ,A2={ 0.25 2.70 -0.04 | , 



-0.52 -0.04 



and 



/ 4.27 -0.13 0.55 \ ^ 

A3 = -0.13 1 0.02 

\ 0.55 0.02 1 / 

The left plot in Figure [2| compares the estimated eigenvalues to the true eigenvalues 
for this example. 



6 



3 Slicing 



A vector x of dimension p can be sliced into p/mi ~ m2 pieces and organized 
into a matrix of order p = mi x m2 for some natural numbers mi and m2. 
Or, in general, the same vector can be organized in an array of dimension 
p — rui X TO2 X ... X mi for some natural numbers mi, m2, . . . , m,j. Once we 
slice the data and reorganize it in array form, we can pretend that this array data 



was generated from the model in Theorem 2.6 We require that the additional 
assumptions stated before the algorithm for estimation of the parameters of 
this model hold. A nonsingular estimate of the covariance matrix A of the 
p— dimensional vector variate random variable can be obtained by using the 
estimators from this algorithm and using A = {Ai & A2 & Ai){Ai & A2 & Ai)' 

That we do not have to assume any covariance components are zero is the 
main difference and advantage of this regularization method to the usual shrink- 
age methods like lasso [4]. 

If {\{Ar)rj} are the rrij eigenvalues of Aj.A'^ with the corresponding eigen- 
vectors {{Xr)rj} for r — 1, 2, . . . , i and Vj — 1,2,..., m^, then {Ai & A2 & 
^0(^1 ^2 & AiY will have eigenvalues {X{Ai)r^X{A2)r2 ■■■X{Ai)rJ with 
corresponding eigenvectors {(a^i)^ & {x2)r2 & ■ ■ ■ & {xi)^}- By replacing Ar 
by their estimators, we estimate the eigenvalues and eigenvectors of the covari- 
ance of rvec{X) using this relationship. Since each eigenvector is a Kronecker 
product of smaller components the reduction in dimension obtained by this ap- 
proach is larger than the one that could be obtained using the ordinary principle 
components on the ordinary sample covariance matrix. 

Example 3.1. Let x ^ Ni2{fi = 0, A).. We illustrate slicing for i — 2, mi — 3 
and m2 = 4. N = 5, 10, 15,20, ...,45,50 sets of N observations were generated 
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and A was estimated using A — Ai & A2 assuming the model in Theorem 
We repeated the whole experiment 5 times. The results are summarized in Figure 
[7} The covariance matrix in the left figure is the identity matrix. In the center 
figure we have A that has the same order Kronecker delta covariance structure as 
the slicing, the components of A are unstructured and generated randomly. The 
right figure is the case where A is a randomly generated unstructured covariance 
matrix. Slicing has as a regularization effect that shrinks the eigenvalues towards 
each other. 

Example 3.2. The Alon colon data set ^21 have expression measurements on 
2000 genes and ni = 40 tumor tissues and 712 — 22 normal tissue samples. We 
will compare the means of the normal and tumor tissue samples. We assume 
first that normal and tumor tissues have the same covariance A, a 2000 x 2000 
positive definite matrix. We slice each of the n = 62 observations into a 40 x 50 
matrix and estimate A with A = Ai & A2 assuming the model in Theorem 



2.6 holds. For testing the equality of the means, we calculate the F^ statistic 



proposed in ^ replacing their estimator of the inverse of covariance matrix A 
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Identity 



Kronecker 



Unstructured 




Figure 1: The true covariance structure is represented with the black hue. The 
estimated covariance structure is denoted by different colors according to the 
sample size. Yellow colors are for small sample sizes, red for moderate sample 
sizes and green for the larger samples. The covariance matrix in the left figure is 
the identity matrix. As N gets larger the estimators of the eigenvalues approach 
the true values. In this case the estimator seem to be consistent. In the center 
figure A has the same order Kronecker delta covariance structure as the slicing, 
each of the components of A are unstructured and generated randomly. The 
right figure is the case where A is a randomly generated unstructured covariance 
matrix. In these last two cases the estimator has a bias that does not decrease on 
the average with increasing sample sizes, however the variance of the estimator 
decreases as the sample size increases. 



8 



with the inverse of A: 



Using the sampling distribution Fr,n-r proposed in ^ assuming that the rank 
r of A is 62 — 1, the p-value is calculated as 0.01445. Thus, the hypothesis of 
equality of the means is rejected. 

Example 3.3. = 10 i.i.d. observations from a Ni2{fJ. — 0, A), distribution are 
generated for a randomly generated unstructured nonsingular covariance matrix 
A. The right plot in Figure^ compares the estimated eigenvalues obtained by 
slicing this data into a 2 x 3 x 3 array with the ordinary sample covariance. 



True and Estimated Eigenvalues 

True Ejnd Estlmalefl Eigenvalues 




10 IS ^ '° 

Index Index 



Figure 2: The left plot in Figure [2] compares the estimated eigenvalues to the 
true eigenvalues (Example 2.2). The right plot in Figure [2] compares the esti- 
mated eigenvalues obtained under different assumptions to the true eigenvalues 
(Example 3.3). The red *'s represent the true values, black o's are for estimates 
under Kronecker delta covariance assumption and the blue -|-'s are for estimates 
under unrestricted covariance assumption. 



Example 3.4. In this example, we will use the heatmap of the true and es- 
timated covariance matrices under different scenarios to see that slicing gives 
a reasonable description of the variable variances and covariances. In Figure 
the true covariance matrix is a 120 x 120 identity matrix, we estimate this 
covariance matrix for N = 10,50, and 100 independent sets of random samples 
by using 15 x 8 slicing. In Figure^the true covariance matrix is a 120 x 120 
block diagonal matrix with Kronecker delta structure. Finally, in Figure the 
true covariance is a matrix with 4 way Kronecker structure. Convergence of the 
estimators is observed even when p » N. 

Example 3.5. We have used the Fisher's linear discriminant analysis for the 
Alon colon data set 12]. The linear discriminant function was calculated using 
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True Covariancs 
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Figure 3: The true covariance matrix is a 120 x 120 identity matrix with Kro- 
necker delta structure. We estimate this covariance matrix for N — 10, 50, and 
100 independent sets of random samples by using 15 x 8 slicing. 






Figure 4: The true covariance matrix is a 120 x 120 block diagonal matrix with 
Kronecker delta structure. We estimate this covariance matrix for N = 10, 50, 
and 100 independent sets of random samples by using 15 x 8 slicing. 
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N=50 N«1M 
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Figure 5: The true covariance matrix is 120 x 120 4-way Kronecker delta struc- 
tured matrix. We estimate this covariance matrix for N — 10, 50, and 100 
independent sets of random samples by using 15 x 8 slicing. 



A 



n 



w = A~^(xi — X2) where A is the covariance estimate from Example 3.2 
observation x was classified as "normal" if x'w > 0, otherwise as "tumor". 
Figure^ summarizes our findings. Misclassification rate is %11.3. 

In practice, how slicing is done matters. For example, a 24 dimensional vec- 
tor could be sliced as 2 x 12, 3 x 8, 4 x 6, or 2 x 3 x 4, etc. In addition, the 
permutation of the variables will effect the estimators. As was discussed ear- 
lier slicing obtains dimension reduction by writing the covariance matrix into 
separable components and we perceive that more parsimonious models can be 
obtained by, for example, proposing a reduced rank mean for the array variable 
obtained after slicing. Yet another direction would be estimating each compo- 
nent of the covariance structure sparsely by using a penalty approach like the 
one used in [3] . These issues and improvements are important and will be dealt 
with in detail in a different article. In the following,we will use the GLASSO 
j5! package which implements the shrinkage estimator of covariance |4j matrices 
will be used in conjunction with the flip flop algorithm. In practice, each of 
the components of the covariance structure could be penalized to individually 
to obtain very sparse nonsingular covariance estimates. This is important for 
variable selection. 

Example 3.6. (Sparse Slicing with GLASSO:) 

We insert the GLASSO algorithm of /^/ at the 4th step of the estimation 
algorithm from Section 2.3, just before scaling of the matrix. The heatmap of 
the estimated correlation matrix for the first 500 components of the Alon colon 
data set obtained by using two way slicing (2Q x 25) and applying GLASSO to 
the components are given in Figure The shrinkage parameters for GLASSO 
should selected by the aid of a model selection technique. Here, the values of these 
parameters are identified tentatively. The expression levels in this dataset were 
ordered with respect to their variances. For the samples of expression levels from 
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Linear Discrimination for Alon Cancer Data 
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Figure 6: Linear discriminant analysis for the Alon colon data set [2 . An 
observation x was classified as "normal" if x'w > 0, otherwise as "tumor". 
Misclassification rate is %11.3. 



normal and tumor tissues, high correlation values (lighter colors in the heatmap ) 
are only observed for the expression levels that have high variance. Low variance 
components have little correlation among each other but they might be mildly 
correlated with the high variance expression levels. The linear discrimination 
of the groups on the 500 high variance expression levels result in 12.9% false 
classification rate. 
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Figure 7: Slicing with GLASSO: The expression levels in this dataset were 
ordered with respect to their variances. For the samples of expression levels 
from normal and tumor tissues, high correlation values (lighter colors in the 
heatmap) are only observed for the expression levels that have high variance. 
Low variance components have little correlation among each other but they 
might be mildly correlated with the high variance expression levels. 
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